Important
BaseComputer should be considered deprecated, and will be removed as soon as it is possible to do so. With recent changes to templates, there should be no reason to use BaseComputer
over Computer
. However, if you find such a situation, please open an issue.
Creating a “True” Computer¶
Important
If your computer exists from before 0.11.x
, it is likely that your computers need updating. See the dedicated section for info about this.
Unscripted¶
Previous tutorials covered generating scripts from already existing jobscripts.
In this tutorial we will be covering how to generate scripts completely from scratch.
An understanding of your target scheduler will help here, however you may be able to get something functional by using the examples and tweaking them until you don’t get errors anymore. You can import from from remotemanager.connection.computers import Resource
, where you will find ExampleSlurm
and ExampleTorque
. These should get you most of the way there, and using the advice later in this tutorial you can use these as a starting point.
Machine Agnosticism¶
The end goal of a Computer
object is to provide a machine agnostic interface between what you put in your notebook/dataset, and what the machine on the other side is expecting. Most often, HPC systems require a request for resources, which is then granted by a scheduler.
Since the syntax of these requests and the scheduler commands themselves can differ wildly between machines, we need a “middleman” to ensure that workflows are uniform across machines.
That is to say, if you have a dataset specifed with mpi=64, omp=4, nodes=1
, for example, this will always request those resources regardless of the machine it is run on.
Hence “machine agnostic”. Ideally a Dataset
that runs on one machine, should be able to run on a completely different one by swapping out the connection.
Resources¶
We’ll start by creating a Computer that requests some common resources. This is done using a Resource
object, which represents one attribute of a machine.
This should make more sense as we progress.
[2]:
from remotemanager import BaseComputer
Just as before, we start from BaseComputer
as does a lot of the internal work for you.
With templates we were only accessing a fraction of what it can do. By subclassing this object we can access the full suite of features.
To request resources we’ll need our Resource
object, so lets import and explore that first.
[3]:
from remotemanager.connection.computers import Resource
The Resource class¶
Resource
is what does all of the conversion work, it specifies how to convert arguments into jobscript strings. As a simple example, lets create a basic Resource
for nodes
.
[4]:
nodes = Resource(name="nodes", flag="nodes")
Sure looks like a lot of repetition there, right? It does, but these inputs are all required. Lets run though them
nodes = ...
Is the initial assignment. It lets us access this resource again. In a computer, you’ll be assinging this toself.nodes
, but we’ll get there later.name = "nodes"
. We need to let theResource
know its own name, this is important for some of the helper methods available inBaseComputer
. A simple rule of thumb is to just set this to whatever you assign the object to. Sompi = Resource(name = "mpi", ...)
, etc.
Important
When specifying resources this way, name
is required.
flag = "nodes"
. This is the actual translation. Theflag
argument dictates what’s put in the jobscript. So for a slurm scheduler, it’s expecting#SBATCH --nodes=n
. In thempi
case above, it would be expecting something similar toflag = ntasks
for#SBATCH --ntasks=t
.
With that explained, lets look at some of the other args of Resource
.
Optional¶
By default, Resource
is marked as optional
. This means that if it is not specified, the Computer
will just ignore it when making the script. If you must have this resource specified, you can flag it as optional=False
when creating it.
This will make the Computer
raise an error when generating the script, and will also allow it to be queried within the required
and missing
properties. But more on that later.
[5]:
print(f"Is nodes optional? {nodes.optional}")
nodes = Resource(name="nodes", flag="nodes", optional=False)
print(f"Is nodes optional? {nodes.optional}")
Is nodes optional? True
Is nodes optional? False
Defaults¶
Resources can also specify a default, which will be presented in the case that nothing has been set.
[6]:
nodes = Resource(name="nodes", flag="nodes")
print(nodes.value)
None
[7]:
nodes = Resource(name="nodes", flag="nodes", default=1)
print(nodes.value)
1
min
and max
¶
Numerical resources can have a specified min
and max
. When setting the value
, a validation step will be done, raising an error if the limits are exceeded.
[8]:
nodes = Resource(name = "nodes", flag = "nodes", max=64)
nodes.value = 16
[9]:
nodes.value = 128
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 nodes.value = 128
File ~/Work/Devel/remotemanager/remotemanager/connection/computers/dynamicvalue.py:348, in DynamicMixin.value(self, value)
346 @value.setter
347 def value(self, value):
--> 348 self.set_value(value)
File ~/Work/Devel/remotemanager/remotemanager/connection/computers/dynamicvalue.py:379, in DynamicMixin.set_value(self, value)
362 def set_value(self, value):
363 """
364 Sets the value, separating out the function allows for property overloading
365
(...)
377 to be a combination of other resources (DV)
378 """
--> 379 self._validate(value)
381 if isinstance(self._value, DynamicValue):
382 # we're setting on an Argument _value
383 if isinstance(value, DynamicValue):
384 # if _value has any extra properties,
385 # need to be careful not to drop them
File ~/Work/Devel/remotemanager/remotemanager/connection/computers/dynamicvalue.py:423, in DynamicMixin._validate(self, value)
419 raise ValueError(
420 f"{value}{nameinsert} is less than minimum value {self.min}"
421 )
422 if self.max is not None and value > self.max:
--> 423 raise ValueError(
424 f"{value}{nameinsert} is more than maximum value {self.max}"
425 )
ValueError: 128 for nodes is more than maximum value 64
Formatting¶
The format
keyword allows some formatting to be done before outputting. The available formats are:
float
: By default, any value that evaluates to a float will be cast to anint
type (to prevent issues like nodes=16.0). Setformat="float"
to avoid this.time
: This format allows automatic conversion oftime={nSeconds}
to ahh:mm:ss
format.
[10]:
int_resource = Resource(name="int", flag="int")
flt_resource = Resource(name="flt", flag="flt", format="float")
int_resource.value = 20/5 # This will be conveted to an int
flt_resource.value = 20/5 # Specifying format="float" will return this as-is
[11]:
print(int_resource.value)
4
[12]:
print(flt_resource.value)
4.0
[13]:
time_resource = Resource(name="time", flag="walltime", format="time")
time_resource.value = 86400 # 24h in seconds
[14]:
print(time_resource.value)
24:00:00
Computer Creation¶
Now we have an understanding of what the Resource
class does, we can put it into a computer.
Lets create a very simple slurm
scheduler interface. Start by creating your class, and subclassing BaseComputer
. You should always make the *args
and **kwargs
available, and pass them to super().__init__(*args, **kwargs)
.
This is what grants your Computer
all of the functionality of both BaseComputer
and URL
.
[15]:
class Computer(BaseComputer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
Important
All computers should start this way, missing the super()
call will cause strange behaviour.
This class will “function” as a URL at the very least. You can give it a user and hostname and it will behave as expected. But this isn’t very useful for us, since we need to access the scheduler. Lets tell it how to do that:
[16]:
class Computer(BaseComputer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.submitter = "sbatch"
self.shebang = "#!/bin/bash"
self.pragma = "#SBATCH"
submitter¶
Previously, we had to specify the submitter
when generating our connection.
However when generating a class version, we can specify it internally, eliminating the need to set it each time you access this machine.
We can additionally set the shebang, if that needs to be specifies. It defaults to #!/bin/bash
, though we have set it here for clarity.
Pragma is a special string that defines the beginning of a resource line. #SBATCH
in slurm, for example.
Defining Resource Requests¶
Now the base URL
knows how to submit a job, what the shebang should be, and the prefix to the resource requests.
Lets add some resources.
[17]:
class Computer(BaseComputer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.submitter = "sbatch"
self.shebang = "#!/bin/bash"
self.pragma = "#SBATCH"
self.mpi = Resource(name="mpi", flag="ntasks", min=1)
self.omp = Resource(name="omp", flag="cpus-per-task", min=1, max=64)
self.nodes = Resource(name="nodes", flag="nodes", optional=False)
self.time = Resource(name="time", flag="walltime", optional=False, format="time", default=3600)
# Note this addition, it will be optional, and we won't specify it.
# This will stop it from appearing in the output
self.test = Resource(name="test", flag="test")
Note
Remember that these objects are arbitrary, and unlimited. A good practice is to go through some jobscripts and create a Resource
object for what you see. Noting that flag
is whatever you see in the jobscript, and name
can be whatever you want.
Lets see what sort of script this generates:
[18]:
test = Computer()
print(test.script())
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[18], line 3
1 test = Computer()
----> 3 print(test.script())
File ~/Work/Devel/remotemanager/remotemanager/connection/computers/base.py:893, in BaseComputer.script(self, **kwargs)
890 self.argument_dict[key].temporary_value = val
892 if not self.valid:
--> 893 raise RuntimeError(f"missing required arguments: {self.missing}")
895 if self.template is None:
896 logger.debug("Creating script from Resources")
RuntimeError: missing required arguments: ['nodes']
Great, an error.
If we look at the definition again we can see what happened.
We defined nodes as non-optional, but then didn’t specify it. So when the Computer went to generate the script it instead raised an error.
We can see from the error message a list of missing arguments to fill in. If we include those (and a few others), we’ll see a better output.
Note
We also specified time
as non optional, but then also provided a default. Since it can fall back to its default value, it was not mentioned in the error output.
[19]:
test.mpi = 4
test.omp = 4
test.nodes = 1
print(test.script())
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --nodes=1
#SBATCH --walltime=01:00:00
Et voila! We specifed some resources, and they were converted into resource request lines as specified in the Computer
and Resource
.
When using a Computer
in a workflow, you are not limited to setting the values at the Computer
level.
Important
A Resource
named “test” was added for this example, but is missing from the output. Just like with Templates, resources without arguments will not be added. This means that it’s preferable to over provision with Resource
objects, rather than exactly meet your targets.
Note
We covered the specifics of this in the template tutorial.
[20]:
from remotemanager import Dataset
def f():
return
url = Computer()
ds = Dataset(f, url=url, skip=False)
ds.set_run_arg("omp", 4)
ds.set_run_arg("mpi", 64)
ds.set_run_arg("nodes", 8)
print(ds.script)
#!/bin/bash
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=4
#SBATCH --nodes=8
#SBATCH --walltime=01:00:00
#SUBMISSION_SUBSTITUTION#
We can also query what is available (and required):
[21]:
test.arguments
[21]:
['mpi', 'nodes', 'omp', 'test', 'time']
[22]:
test.required
[22]:
['nodes']
[23]:
test.missing
[23]:
[]
Note
The missing
parameter here is exactly what is output in the error message we saw above.
valid¶
Instead of checking missing
, or waiting for an error, you can also check the valid
property of a Computer
.
This returns True
if the Computer
thinks that it will produce a working jobscript, and False
if not.
Note
This is the same condition checked when raising the RuntimeError
for missing values.
[24]:
test = Computer()
test.valid
[24]:
False
[25]:
test.missing
[25]:
['nodes']
[26]:
test.nodes = 4
test.valid
[26]:
True
Dynamic Values¶
Added in version 0.11.0.
Values can also depend on other values. This allows you to calculate based on other inputs. An obvious use case for this is calculating the nodes request.
Lets assume we have a machine that has 128 cores per node, we can specify the value this way:
[27]:
test = Computer()
test.mpi = 128
test.omp = 4
test.nodes = test.mpi * test.omp / 128
test.time = 3600
Since each node can handle 128 possible tasks, we need to calculate the total number of tasks (ntasks * omp), then divide by 128 to get the number of nodes needed to handle this amount of tasks.
[28]:
print(test.script())
#!/bin/bash
#SBATCH --ntasks=128
#SBATCH --cpus-per-task=4
#SBATCH --nodes=4
#SBATCH --walltime=01:00:00
Here the value comes out to 4, as we’d expect. Since we did not specify format="float"
, the value is rounded up and cast to Int type.
Rounded up for the case where we request a fractional (by nodes) amount of tasks. We should still request enough nodes to cover the request.
Lets change mpi to 112
. Since 112 * 4 / 128 = 3.5
and we can’t have 3.5 nodes, we need to request 4 nodes:
[29]:
test.mpi = 112
print(test.script())
#!/bin/bash
#SBATCH --ntasks=112
#SBATCH --cpus-per-task=4
#SBATCH --nodes=4
#SBATCH --walltime=01:00:00
Note
What happens to this mismatched number of cores and nodes is up to the scheduler.
DynamicValues as Defaults¶
Added in version 0.11.2.
Setting values to be dynamic can be very useful for automated parameterisation, but you can go one step further and set them as defaults.
Warning
This is a complex feature, and it’s likely that there are still some edge cases yet to be found. If your Computer definition exhibits strange behaviour please file a bug report. Even if it’s not a bug, it’s likely an issue with the documentation.
Lets generate a simple computer to exhibit the automatic nodes behaviour by default:
[30]:
class Computer(BaseComputer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.cores_per_node = 128
self.submitter = "sbatch"
self.shebang = "#!/bin/bash"
self.pragma = "#SBATCH"
self.mpi = Resource(name="mpi", flag="ntasks")
self.omp = Resource(name="omp", flag="cpus-per-task")
self.nodes = Resource(name="nodes", flag="nodes", default=self.mpi * self.omp / self.cores_per_node)
Now when we ask for a script without specifying nodes, it is calculated for us.
[31]:
test = Computer()
test.mpi = 128
test.omp = 4
print(test.script())
#!/bin/bash
#SBATCH --ntasks=128
#SBATCH --cpus-per-task=4
#SBATCH --nodes=4
And just to hammer the point home, if we double the mpi request, we’ll see that the request for nodes also increases automatically.
[32]:
test.mpi = 256
test.omp = 4
print(test.script())
#!/bin/bash
#SBATCH --ntasks=256
#SBATCH --cpus-per-task=4
#SBATCH --nodes=8
Adding to Python objects¶
While it is safe to add python intrinsics to a variable, the reverse is not true.
These resource objects can be “chained” into each other as they provide special function to handle the operators.
When adding to standard object, these operator modifications are not present, so you’ll get an error.
Solutions¶
There are two solutions to this issue. The first is to create a Resource for the variable.
This is the “heavy” approach, and allows this variable to be tweaked and changed.
However this can rapidly spiral out of control, and some variables simply don’t need to have a hook.
For this, there is concat_basic
. This performs the addition for you, ensuring that the returned object is properly linked.
[33]:
from remotemanager.connection.computers import concat_basic
value = Resource(name="value", default=10)
string_prefix = Resource(name="strprefix", default=concat_basic("test_", value))
print(string_prefix.value)
test_10
This Script Format Doesn’t Fit my Scheduler¶
The scripts that have been generated thus far in the tutorials have been of one single format.
Some schedulers have a more specific requirements, so need to be treated differently.
The tutorial will cover the details of how these scripts are generated, which will allow you to personalise them much more deeply.